{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "## all imports\n",
    "from IPython.display import HTML\n",
    "import numpy as np\n",
    "import urllib2\n",
    "import bs4 #this is beautiful soup\n",
    "import time\n",
    "import operator\n",
    "import socket\n",
    "import cPickle\n",
    "import re # regular expressions\n",
    "\n",
    "from pandas import Series\n",
    "import pandas as pd\n",
    "from pandas import DataFrame\n",
    "\n",
    "import matplotlib\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "import seaborn as sns\n",
    "sns.set_context(\"talk\")\n",
    "sns.set_style(\"white\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "API registrations\n",
    "=================\n",
    "\n",
    "If you would like to run all the examples in this notebook, you need to register for the following APIs:\n",
    "\n",
    "* Rotten Tomatoes\n",
    "\n",
    "http://developer.rottentomatoes.com/member/register\n",
    "\n",
    "* Twitter\n",
    "\n",
    "https://apps.twitter.com/app/new\n",
    "\n",
    "* Twitter instructions\n",
    "\n",
    "https://twittercommunity.com/t/how-to-get-my-api-key/7033"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "CS109\n",
    "=====\n",
    "\n",
    "Verena Kaynig-Fittkau, Joe Blitzstein, Hanspeter Pfister\n",
    "\n",
    "* vkaynig@seas.harvard.edu\n",
    "* staff@cs109.org"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Announcements\n",
    "==============\n",
    "\n",
    "* Over 400 sign ups on github!\n",
    "* If you are still missing, fill out the survey, time is running out!\n",
    "* Make sure you are on Piazza!\n",
    "\n",
    "\n",
    "* More [git help](https://www.youtube.com/channel/UC0-KaiZFXBlGOFN71YsEV8g/videos)\n",
    "\n",
    "\n",
    "* HW0 is due on Thursday \n",
    "* HW1 is coming out on Thursday\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Todays lecture:\n",
    "===============\n",
    "\n",
    "* introduction to pandas\n",
    "    - read a table\n",
    "    - do some plots\n",
    "\n",
    "* all about data scraping\n",
    "* ***What is it? ***\n",
    "* How to do it:\n",
    "    - from a website\n",
    "    - with an API"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "IPython Notebooks:\n",
    "===================\n",
    "\n",
    "![IPython](images/ipython.png \"IPython\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "IPython Notebooks:\n",
    "===================\n",
    "\n",
    "* These slides are an IPython notebook!\n",
    "* https://github.com/damianavila/live_reveal"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print \"Hello CS109\"\n",
    "\n",
    "print \"I love IPython\"\n",
    "\n",
    "# Ipython notebook have tab completion!\n",
    "# and inbuild help\n",
    "  \n",
    "a = np.zeros(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "General advice about programming\n",
    "==================================\n",
    "\n",
    "* You will find nearly everything on google\n",
    "* Try: length of a list in python\n",
    "* A programmer is someone who can turn stack overflow snippets into running code\n",
    "* Use tab completion\n",
    "* Make your variable names meaningful\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "How to load a table\n",
    "===================\n",
    "\n",
    "* we use Pandas for this\n",
    "* Pandas can do a __lot__ more\n",
    "* more about it later"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "The MovieLens data\n",
    "===================\n",
    "\n",
    "http://grouplens.org/datasets/movielens/\n",
    "\n",
    "![Grouplens](images/grouplens.jpg \"Grouplens\")\n",
    "\n",
    "Example inspired by [Greg Reda](http://www.gregreda.com/2013/10/26/using-pandas-on-the-movielens-dataset/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Read the user data\n",
    "=================="
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "# pass in column names for each CSV\n",
    "u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']\n",
    "\n",
    "users = pd.read_csv(\n",
    "    'http://files.grouplens.org/datasets/movielens/ml-100k/u.user', \n",
    "    sep='|', names=u_cols)\n",
    "\n",
    "users.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Read the ratings\n",
    "============"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']\n",
    "ratings = pd.read_csv(\n",
    "    'http://files.grouplens.org/datasets/movielens/ml-100k/u.data', \n",
    "    sep='\\t', names=r_cols)\n",
    "\n",
    "ratings.head() "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Now data about the movies\n",
    "========================="
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# the movies file contains columns indicating the movie's genres\n",
    "# let's only load the first five columns of the file with usecols\n",
    "m_cols = ['movie_id', 'title', 'release_date', \n",
    "            'video_release_date', 'imdb_url']\n",
    "\n",
    "movies = pd.read_csv(\n",
    "    'http://files.grouplens.org/datasets/movielens/ml-100k/u.item', \n",
    "    sep='|', names=m_cols, usecols=range(5))\n",
    "\n",
    "movies.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Get information about data\n",
    "======================="
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print movies.dtypes\n",
    "print\n",
    "print movies.describe()\n",
    "# *** Why only those two columns? ***"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Selecting data\n",
    "==============\n",
    "\n",
    "* DataFrame => group of Series with shared index\n",
    "* single DataFrame column => Series"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "users.head()\n",
    "users['occupation'].head()\n",
    "## *** Where did the nice design go? ***\n",
    "columns_you_want = ['occupation', 'sex'] \n",
    "users[columns_you_want].head()\n",
    "\n",
    "print users.head()\n",
    "\n",
    "print users.iloc[3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Filtering data\n",
    "==============\n",
    "\n",
    "Select users older than 25"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "oldUsers = users[users.age > 25]\n",
    "oldUsers.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Quiz:\n",
    "=====\n",
    "\n",
    "* show users aged 40 and male\n",
    "\n",
    "* show the mean age of female programmers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "# users aged 40 AND male\n",
    "# your code here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "## users who are female and programmers\n",
    "# your code here\n",
    "\n",
    "## show statistic summary or compute mean\n",
    "# your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Split-apply-combine\n",
    "===================\n",
    "\n",
    "* splitting the data into groups based on some criteria\n",
    "* applying a function to each group independently\n",
    "* combining the results into a data structure"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Split-apply-combine\n",
    "===================\n",
    "\n",
    "<img src=http://i.imgur.com/yjNkiwL.png></img>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Find Diligent Users\n",
    "===================\n",
    "\n",
    "* split data per user ID\n",
    "* count ratings\n",
    "* combine result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "print ratings.head()\n",
    "## split data\n",
    "grouped_data = ratings.groupby('user_id')\n",
    "#grouped_data = ratings['movie_id'].groupby(ratings['user_id'])\n",
    "\n",
    "## count and combine\n",
    "ratings_per_user = grouped_data.count()\n",
    "\n",
    "ratings_per_user.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Quiz\n",
    "====\n",
    "\n",
    "* get the average rating per movie\n",
    "* advanced: get the movie titles with the highest average rating"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "## split data\n",
    "# your code here\n",
    "\n",
    "## average and combine\n",
    "# your code here\n",
    "\n",
    "# get the maximum rating\n",
    "# your code here\n",
    "\n",
    "# get movie ids with that rating\n",
    "# your code here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "print \"Good movie ids:\"\n",
    "print #your code here\n",
    "print\n",
    "\n",
    "print \"Best movie titles\"\n",
    "print # your code here\n",
    "print"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "# get number of ratings per movie\n",
    "# your code here\n",
    "\n",
    "print \"Number of ratings per movie\"\n",
    "print # your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Passing a Function\n",
    "==================\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "average_ratings = grouped_data.apply(lambda f: f.mean())\n",
    "average_ratings.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Quiz\n",
    "====\n",
    "\n",
    "* get the average rating per user\n",
    "* advanced: list all occupations and if they are male or female dominant"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "# get the average rating per user\n",
    "# your code here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "# list all occupations and if they are male or female dominant\n",
    "# your code here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "print 'number of male users: '\n",
    "print sum(users['sex'] == 'M')\n",
    "\n",
    "print 'number of female users: '\n",
    "print sum(users['sex'] == 'F')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Pandas \"wrapup\"\n",
    "==========\n",
    "\n",
    "- create data frames\n",
    "- get sub-frames\n",
    "- filter data \n",
    "- use group-by\n",
    "- apply a user defined function\n",
    "\n",
    "\n",
    "![cute panda](images/cute_panda.jpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Python data scraping\n",
    "====================\n",
    "\n",
    "* Why scrape the web?\n",
    "    - vast source of information\n",
    "    - automate tasks\n",
    "    - keep up with sites\n",
    "    - fun!\n",
    "\n",
    "** Can you think of examples ? **"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Read and Tweet!\n",
    "=================\n",
    "\n",
    "![ReadTweet](http://developer.nytimes.com/files/readtweet.jpg)\n",
    "\n",
    "* by Justin Blinder\n",
    "* http://projects.justinblinder.com/We-Read-We-Tweet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "“We Read, We Tweet” geographically visualizes the dissemination of New York Times articles through Twitter. Each line connects the location of a tweet to the contextual location of the New York Times article it referenced. The lines are generated in a sequence based on the time in which a tweet occurs. The project explores digital news distribution in a temporal and spatial context through the social space of Twitter."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Twitter Sentiments\n",
    "=================\n",
    "\n",
    "![TwitterSentiments](http://www.csc.ncsu.edu/faculty/healey/tweet_viz/figs/tweet-viz-ex.png\n",
    " \"Twitter Sentiments\")\n",
    "\n",
    "* by Healey and Ramaswamy\n",
    "* http://www.csc.ncsu.edu/faculty/healey/tweet_viz/tweet_app/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "Type a keyword into the input field, then click the Query button. Recent tweets that contain your keyword are pulled from Twitter and visualized in the Sentiment tab as circles. Hover your mouse over a tweet or click on it to see its text."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Python data scraping\n",
    "====================\n",
    "\n",
    "* copyrights and permission:\n",
    "    - be careful and polite\n",
    "    - give credit\n",
    "    - care about media law\n",
    "    - don't be evil (no spam, overloading sites, etc.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Robots.txt\n",
    "==========\n",
    "\n",
    "![Robots.txt](images/robots_txt.jpg \"Robots.txt\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Robots.txt\n",
    "==========\n",
    "\n",
    "* specified by web site owner\n",
    "* gives instructions to web robots (aka your script)\n",
    "* is located at the top-level directory of the web server\n",
    "\n",
    "http://www.example.com/robots.txt\n",
    "\n",
    "If you want you can also have a look at\n",
    "\n",
    "http://google.com/robots.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Robots.txt\n",
    "==========\n",
    "\n",
    "*** What does this one do? ***"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "\n",
    "User-agent: Google\n",
    "Disallow:\n",
    "\n",
    "User-agent: *\n",
    "Disallow: /"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Things to consider:\n",
    "-------------------\n",
    "\n",
    "* can be just ignored\n",
    "* can be a security risk - *** Why? ***"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Scraping with Python:\n",
    "=====================\n",
    "\n",
    "* scraping is all about HTML tags\n",
    "* bad news: \n",
    "    - need to learn about tags\n",
    "    - websites can be ugly"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "HTML\n",
    "=====\n",
    "\n",
    "* HyperText Markup Language\n",
    "\n",
    "* standard for creating webpages\n",
    "\n",
    "* HTML tags \n",
    "    - have angle brackets\n",
    "    - typically come in pairs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "This is an example for a minimal webpage defined in HTML tags. The root tag is `<html>` and then you have the `<head>` tag. This part of the page typically includes the title of the page and might also have other meta information like the author or keywords that are important for search engines. The `<body>` tag marks the actual content of the page. You can play around with the `<h2>` tag trying different header levels. They range from 1 to 6. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "htmlString = \"\"\"<!DOCTYPE html>\n",
    "<html>\n",
    "  <head>\n",
    "    <title>This is a title</title>\n",
    "  </head>\n",
    "  <body>\n",
    "    <h2> Test </h2>\n",
    "    <p>Hello world!</p>\n",
    "  </body>\n",
    "</html>\"\"\"\n",
    "\n",
    "htmlOutput = HTML(htmlString)\n",
    "htmlOutput"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Useful Tags\n",
    "===========\n",
    "\n",
    "* heading\n",
    "`<h1></h1> ... <h6></h6>`\n",
    "\n",
    "* paragraph\n",
    "`<p></p>` \n",
    "\n",
    "* line break\n",
    "`<br>` \n",
    "\n",
    "* link with attribute\n",
    "\n",
    "`<a href=\"http://www.example.com/\">An example link</a>`\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Scraping with Python:\n",
    "=====================\n",
    "\n",
    "* example of a beautifully simple webpage:\n",
    "\n",
    "http://www.crummy.com/software/BeautifulSoup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Scraping with Python:\n",
    "=====================\n",
    "\n",
    "* good news: \n",
    "    - some browsers help\n",
    "    - look for: inspect element\n",
    "    - need only basic html\n",
    "    \n",
    "** Try 'Ctrl-Shift I' in Chrome **\n",
    "\n",
    "** Try 'Command-Option I' in Safari **\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Scraping with Python\n",
    "==================\n",
    "\n",
    "* different useful libraries:\n",
    "    - urllib\n",
    "    - beautifulsoup\n",
    "    - pattern\n",
    "    - soupy\n",
    "    - LXML\n",
    "    - ...\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "The following cell just defines a url as a string and then reads the data from that url using the `urllib` library. If you uncomment the print command you see that we got the whole HTML content of the page into the string variable source."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "url = 'http://www.crummy.com/software/BeautifulSoup'\n",
    "source = urllib2.urlopen(url).read()\n",
    "print source"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Quiz :\n",
    "======\n",
    "\n",
    "* Is the word 'Alice' mentioned on the beautiful soup homepage?\n",
    "* How often does the word 'Soup' occur on the site?\n",
    "    - hint: use `.count()`\n",
    "* At what index occurs the substring 'alien video games' ?\n",
    "    - hint: use `.find()`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "## is 'Alice' in source?\n",
    "\n",
    "## count occurences of 'Soup'\n",
    "\n",
    "## find index of 'alien video games'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Beautiful Soup\n",
    "==============\n",
    "\n",
    "* designed to make your life easier\n",
    "* many good functions for parsing html code"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Some examples\n",
    "=============\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "## get bs4 object\n",
    "soup = bs4.BeautifulSoup(source)\n",
    " \n",
    "## compare the two print statements\n",
    "#print soup\n",
    "#print soup.prettify()\n",
    "\n",
    "## show how to find all a tags\n",
    "soup.findAll('a')\n",
    "\n",
    "## ***Why does this not work? ***\n",
    "#soup.findAll('Soup')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Some examples\n",
    "============="
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "## get attribute value from an element:\n",
    "## find tag: this only returns the first occurrence, not all tags in the string\n",
    "first_tag = soup.find('a')\n",
    "\n",
    "## get attribute `href`\n",
    "first_tag.get('href')\n",
    "\n",
    "## get all links in the page\n",
    "link_list = [l.get('href') for l in soup.findAll('a')]\n",
    "link_list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "## filter all external links\n",
    "# create an empty list to collect the valid links\n",
    "external_links = []\n",
    "\n",
    "# write a loop to filter the links\n",
    "# if it starts with 'http' we are happy\n",
    "for l in link_list:\n",
    "    if l[:4] == 'http':\n",
    "        external_links.append(l)\n",
    "\n",
    "# this throws an error! It says something about 'NoneType'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "# lets investigate. Have a close look at the link_list:\n",
    "link_list\n",
    "\n",
    "# Seems that there are None elements!\n",
    "# Let's verify\n",
    "#print sum([l is None for l in link_list])\n",
    "\n",
    "# So there are two elements in the list that are None!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "# Let's filter those objects out in the for loop\n",
    "external_links = []\n",
    "\n",
    "# write a loop to filter the links\n",
    "# if it is not None and starts with 'http' we are happy\n",
    "for l in link_list:\n",
    "    if l is not None and l[:4] == 'http':\n",
    "        external_links.append(l)\n",
    "        \n",
    "external_links"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "Note: The above `if` condition works because of lazy evaluation in Python. The `and` statement becomes `False` if the first part is `False`, so there is no need to ever evaluate the second part. Thus a `None` entry in the list gets never asked about its first four characters. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "# and we can put this in a list comprehension as well, it almost reads like \n",
    "# a sentence.\n",
    "\n",
    "[l for l in link_list if l is not None and l.startswith('http')]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Parsing the Tree\n",
    "================\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# redifining `s` without any line breaks\n",
    "s = \"\"\"<!DOCTYPE html><html><head><title>This is a title</title></head><body><h3> Test </h3><p>Hello world!</p></body></html>\"\"\"\n",
    "## get bs4 object\n",
    "tree = bs4.BeautifulSoup(s)\n",
    "\n",
    "## get html root node\n",
    "root_node = tree.html\n",
    "\n",
    "## get head from root using contents\n",
    "head = root_node.contents[0]\n",
    "\n",
    "## get body from root\n",
    "body = root_node.contents[1]\n",
    "\n",
    "## could directly access body\n",
    "tree.body"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Quiz:\n",
    "=====\n",
    "\n",
    "* Find the `h3` tag by parsing the tree starting at `body`\n",
    "* Create a list of all __Hall of Fame__ entries listed on the Beautiful Soup webpage\n",
    "    - hint: it is the only unordered list in the page (tag `ul`)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "## get h3 tag from body\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "## use ul as entry point\n",
    "\n",
    "\n",
    "## get hall of fame list from entry point\n",
    "## skip the first entry \n",
    "\n",
    "## reformat into a list containing strings\n",
    "## it is ok to have a list of lists"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "`tmp` now is actually a list of lists containing the hall of fame entries. \n",
    "Here is some advanced Python on how to print really just one entry per list item.\n",
    "\n",
    "The cool things about this are: \n",
    "* The use of `\"\"` to just access the `join` function of strings.\n",
    "* The `join` function itself\n",
    "* that you can actually have two nested for loops in a list comprehension"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "test =  [\"\".join(str(a) for a in sublist) for sublist in tmp]\n",
    "print '\\n'.join(test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Advanced Example\n",
    "===============\n",
    "\n",
    "Idea by [Jesse Steinweg-Woods](https://jessesw.com/Data-Science-Skills/)\n",
    "--------------------------------------------------------------------------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Scraping data science skills\n",
    "=============================\n",
    "\n",
    "- What skills are in demand for data scientists?\n",
    "- Should we have a lecture on Spark or only on MapReduce?\n",
    "\n",
    "We want to scrape the information from job advertisements for data scientists from indeed.com\n",
    "Let's scrape and find out!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "# Fixed url for job postings containing data scientist\n",
    "url = 'http://www.indeed.com/jobs?q=data+scientist&l='\n",
    "# read the website\n",
    "source = urllib2.urlopen(url).read()\n",
    "# parse html code\n",
    "bs_tree = bs4.BeautifulSoup(source)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "# see how many job postings we found\n",
    "job_count_string = bs_tree.find(id = 'searchCount').contents[0]\n",
    "job_count_string = job_count_string.split()[-1]\n",
    "print(\"Search yielded %s hits.\" % (job_count_string))\n",
    "\n",
    "# not that job_count so far is still a string, \n",
    "# not an integer, and the , separator prevents \n",
    "# us from just casting it to int\n",
    "\n",
    "job_count_digits = [int(d) for d in job_count_string if d.isdigit()]\n",
    "job_count = np.sum([digit*(10**exponent) for digit, exponent in \n",
    "                    zip(job_count_digits[::-1], range(len(job_count_digits)))])\n",
    "\n",
    "print job_count"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "# The website is only listing 10 results per page, \n",
    "# so we need to scrape them page after page\n",
    "num_pages = int(np.ceil(job_count/10.0))\n",
    "\n",
    "base_url = 'http://www.indeed.com'\n",
    "job_links = []\n",
    "for i in range(1): #do range(num_pages) if you want them all\n",
    "    if i%10==0:\n",
    "        print num_pages-i\n",
    "    url = 'http://www.indeed.com/jobs?q=data+scientist&start=' + str(i*10)\n",
    "    html_page = urllib2.urlopen(url).read() \n",
    "    bs_tree = bs4.BeautifulSoup(html_page)\n",
    "    job_link_area = bs_tree.find(id = 'resultsCol')\n",
    "    job_postings = job_link_area.findAll(\"div\")\n",
    "    job_postings = [jp for jp in job_postings if not jp.get('class') is None \n",
    "                    and ''.join(jp.get('class')) ==\"rowresult\"]\n",
    "    job_ids = [jp.get('data-jk') for jp in job_postings]\n",
    "    \n",
    "    # go after each link\n",
    "    for id in job_ids:\n",
    "        job_links.append(base_url + '/rc/clk?jk=' + id)\n",
    "\n",
    "    time.sleep(1)\n",
    "\n",
    "print \"We found a lot of jobs: \", len(job_links)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Some precautions to enable us to restart our search\n",
    "========================="
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "# Save the scraped links\n",
    "#with open('data/scraped_links.pkl', 'wb') as f:\n",
    "#    cPickle.dump(job_links, f)\n",
    "    \n",
    "# Read canned scraped links\n",
    "with open('data/scraped_links.pkl', 'r') as f:\n",
    "    job_links = cPickle.load(f)    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "skill_set = {'mapreduce': 0, 'spark': 0}\n",
    "\n",
    "## write initialization into a file, so we can restart later\n",
    "#with open('scraped_links_restart.pkl', 'wb') as f:\n",
    "#    cPickle.dump((skill_set, 0),f)    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Python Dictonaries\n",
    "==================\n",
    "\n",
    "* build in data type\n",
    "* uses key: value pairs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "a = {'a': 1, 'b':2}\n",
    "print a\n",
    "\n",
    "#show keys\n",
    "print a.keys()\n",
    "\n",
    "#show values\n",
    "print a.values()\n",
    "\n",
    "#show for loop over all entries\n",
    "# option 1 using zip\n",
    "# this works also for iterating over any\n",
    "# other two lists\n",
    "for k,v in zip(a.keys(), a.values()):\n",
    "    print k,v\n",
    "\n",
    "# option 2 using the dictionary `iteritems()` function\n",
    "for k,v in a.iteritems():\n",
    "    print k,v"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "# This code below does the trick, but could be optimized for speed if necessary\n",
    "# e.g. skills are typically listed at the end of the webpage\n",
    "# might not need to split/join the whole webpage, as we already know\n",
    "# which words we are looking for \n",
    "# and could stop after the first occurance of each word\n",
    "\n",
    "with open('data/scraped_links_restart.pkl', 'r') as f:\n",
    "    skill_set, index = cPickle.load(f)\n",
    "    print \"How many websites still to go? \", len(job_links) - index\n",
    "    \n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "counter = 0\n",
    "\n",
    "for link in job_links[index:]:\n",
    "    counter +=1  \n",
    "    \n",
    "    try:\n",
    "        html_page = urllib2.urlopen(link).read()\n",
    "    except urllib2.HTTPError:\n",
    "        print \"HTTPError:\"\n",
    "        continue\n",
    "    except urllib2.URLError:\n",
    "        print \"URLError:\"\n",
    "        continue\n",
    "    except socket.error as error:\n",
    "        print \"Connection closed\"\n",
    "        continue\n",
    "\n",
    "    html_text = re.sub(\"[^a-z.+3]\",\" \", html_page.lower()) # replace all but the listed characters\n",
    "        \n",
    "    for key in skill_set.keys():\n",
    "        if key in html_text:  \n",
    "            skill_set[key] +=1\n",
    "            \n",
    "    if counter % 5 == 0:\n",
    "        print len(job_links) - counter - index\n",
    "        print skill_set\n",
    "        with open('scraped_links_restart.pkl','wb') as f:\n",
    "            cPickle.dump((skill_set, index+counter),f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "print skill_set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "pseries = pd.Series(skill_set)\n",
    "pseries.sort(ascending=False)\n",
    "\n",
    "pseries.plot(kind = 'bar')\n",
    "## set the title to Score Comparison\n",
    "plt.title('Data Science Skills')\n",
    "## set the x label\n",
    "plt.xlabel('Skills')\n",
    "## set the y label\n",
    "plt.ylabel('Count')\n",
    "## show the plot\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Another Example\n",
    "================\n",
    "\n",
    "Designed by Katharine Jarmul\n",
    "----------------------------\n",
    "\n",
    "https://github.com/kjam/python-web-scraping-tutorial\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Scraping Happy Hours\n",
    "====================\n",
    "\n",
    "Scrape the happy hour list of LA for personal preferences\n",
    "\n",
    "http://www.downtownla.com/3_10_happyHours.asp?action=ALL\n",
    "\n",
    "This example is part of her talk about data scraping at PyCon2014. She is a really good speaker and I enjoyed watching her talk. Check it out: http://www.youtube.com/watch?v=p1iX0uxM1w8"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "stuff_i_like = ['burger', 'sushi', 'sweet potato fries', 'BBQ','beer']\n",
    "found_happy_hours = []\n",
    "my_happy_hours = []\n",
    "# First, I'm going to identify the areas of the page I want to look at\n",
    "url = 'http://www.downtownla.com/3_10_happyHours.asp?action=ALL'\n",
    "source = urllib2.urlopen(url).read()\n",
    "tables = bs4.BeautifulSoup(source)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "# Then, I'm going to sort out the *exact* parts of the page\n",
    "# that match what I'm looking for...\n",
    "for t in tables.findAll('p', {'class': 'calendar_EventTitle'}):\n",
    "    text = t.text\n",
    "    for s in t.findNextSiblings():\n",
    "        text += '\\n' + s.text\n",
    "    found_happy_hours.append(text)\n",
    "\n",
    "print \"The scraper found %d happy hours!\" % len(found_happy_hours)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "# Now I'm going to loop through the food I like\n",
    "# and see if any of the happy hour descriptions match\n",
    "for food in stuff_i_like:\n",
    "    for hh in found_happy_hours:\n",
    "        # checking for text AND making sure I don't have duplicates\n",
    "        if food in hh and hh not in my_happy_hours:\n",
    "            print \"YAY! I found some %s!\" % food\n",
    "            my_happy_hours.append(hh)\n",
    "\n",
    "print \"I think you might like %d of them, yipeeeee!\" % len(my_happy_hours)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "# Now, let's make a mail message we can read:\n",
    "message = 'Hey Katharine,\\n\\n\\n'\n",
    "message += 'OMG, I found some stuff for you in Downtown, take a look.\\n\\n'\n",
    "message += '==============================\\n'.join(my_happy_hours)\n",
    "message = message.encode('utf-8')\n",
    "# To read more about encoding:\n",
    "# http://diveintopython.org/xml_processing/unicode.html\n",
    "message = message.replace('\\t', '').replace('\\r', '')\n",
    "message += '\\n\\nXOXO,\\n Your Py Script'\n",
    "\n",
    "#print message"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Getting Data with an API\n",
    "=========================\n",
    "\n",
    "* API: application programming interface\n",
    "* some sites try to make your life easier\n",
    "* Twitter, New York Times, ImDB, rotten Tomatoes, Yelp, ..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Rotten Tomatoes\n",
    "===============\n",
    "\n",
    "![The Wizard of Oz](images/wiz_oz.png \"The wizard of Oz\")\n",
    "\n",
    "http://www.rottentomatoes.com/top/\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "API keys\n",
    "=========\n",
    "\n",
    "* required for data access\n",
    "* identifies application (you)\n",
    "* monitors usage\n",
    "* limits rates"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Rotten Tomatoes Key\n",
    "===================\n",
    "\n",
    "http://developer.rottentomatoes.com/member/register"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import json\n",
    "import requests\n",
    "\n",
    "api_key = rottenTomatoes_key()\n",
    "\n",
    "url = 'http://api.rottentomatoes.com/api/public/v1.0/lists/dvds/top_rentals.json?apikey=' + api_key\n",
    "data = urllib2.urlopen(url).read()\n",
    "#print data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "JSON\n",
    "======\n",
    "\n",
    "* JavaScript Object Notation\n",
    "* human readable\n",
    "* transmit attribute-value pairs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "a = {'a': 1, 'b':2}\n",
    "s = json.dumps(a)\n",
    "a2 = json.loads(s)\n",
    "\n",
    "## a is a dictionary\n",
    "print a\n",
    "## vs s is a string containing a in JSON encoding\n",
    "print s\n",
    "## reading back the keys are now in unicode\n",
    "print a2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "## create dictionary from JSON \n",
    "dataDict = json.loads(data)\n",
    "\n",
    "## expore dictionary\n",
    "print dataDict.keys()\n",
    "\n",
    "## there is a key named `movies` containing a list of movies as a value\n",
    "movies = dataDict['movies']\n",
    "\n",
    "## each element of the list `movies` is a dictionary\n",
    "print movies[0].keys()\n",
    "\n",
    "## one of the keys is called `ratings`\n",
    "## the value is yet another dictionary\n",
    "print movies[0]['ratings'].keys()\n",
    "\n",
    "## so we made it all the way to find the critics score\n",
    "print movies[0]['ratings']['critics_score']\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Quiz\n",
    "=====\n",
    "\n",
    "* build a list with critics scores\n",
    "* build a list with audience scores"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "# critics scores list\n",
    "\n",
    "\n",
    "# audience scores list\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "The following code shows how to create a pandas data frame with the data we gathered from the webpage.\n",
    "Beware of the `set_index()` function in pandas. Per default it does not change the actual data frame! You need to either reassign the output or set the `inplace` argument to `True`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "## create pandas data frame with critics and audience score\n",
    "scores = pd.DataFrame(data=[critics_scores, audience_scores]).transpose()\n",
    "scores.columns = ['critics', 'audience']\n",
    "\n",
    "## also create a list with all movie titles\n",
    "movie_titles = [m['title'] for m in movies]\n",
    "\n",
    "## set index of dataFrame BEWARE of inplace!\n",
    "scores.set_index([movie_titles])\n",
    "\n",
    "## the line above does not changes scores!\n",
    "## You need to either reassign\n",
    "\n",
    "scores = scores.set_index([movie_titles])\n",
    "\n",
    "## or set the inplace argument to True\n",
    "scores.set_index([movie_titles], inplace=True)\n",
    "scores.head(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "## create a bar plot with the data\n",
    "## notice that we are using the data frame itself and call its plot function\n",
    "scores.plot(kind = 'bar')\n",
    "\n",
    "## set the title to Score Comparison\n",
    "plt.title('Score Comparison')\n",
    "\n",
    "## set the x label\n",
    "plt.xlabel('Movies')\n",
    "\n",
    "## set the y label\n",
    "plt.ylabel('Scores')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "## show the plot\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Twitter Example:\n",
    "================\n",
    "\n",
    "* API a bit more complicated\n",
    "* libraries make life easier\n",
    "* python-twitter\n",
    "\n",
    "https://github.com/bear/python-twitter\n",
    "\n",
    "What we are going to do is scrape Joe's twitter account, and then filter it for the interesting tweets. Defining interesting as tweets that have be re-tweeted at least 10 times. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "import twitter\n",
    "\n",
    "## define the necessary keys\n",
    "cKey = twitterAPI_key()\n",
    "cSecret = twitterAPI_secret()\n",
    "aKey = twitterAPI_access_token_key()\n",
    "aSecret = twitterAPI_access_token_secret()\n",
    "\n",
    "## create the api object with the twitter-python library\n",
    "api = twitter.Api(consumer_key=cKey, consumer_secret=cSecret, access_token_key=aKey, access_token_secret=aSecret)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "## get the user timeline with screen_name = 'stat110'\n",
    "twitter_statuses = api.GetUserTimeline(screen_name = 'stat110')\n",
    "\n",
    "## create a data frame\n",
    "## first get a list of panda Series or dict\n",
    "pdSeriesList = [pd.Series(t.AsDict()) for t in twitter_statuses]\n",
    "\n",
    "## then create the data frame\n",
    "data = pd.DataFrame(pdSeriesList)\n",
    "\n",
    "data.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "## filter tweets with enough retweet_count\n",
    "maybe_interesting = data[data.retweet_count>20]\n",
    "\n",
    "## get the text of these tweets\n",
    "tweet_text = maybe_interesting.text\n",
    "\n",
    "## print them out\n",
    "text = tweet_text.values\n",
    "\n",
    "for t in text:\n",
    "    print '######'\n",
    "    print t"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "Extracting columns:\n",
    "===================\n",
    "\n",
    "__Warning:__ The returned column `tweet_text` is a `view` on the data\n",
    "    \n",
    "* it is not a copy\n",
    "* you change the Series => you change the DataFrame\n",
    "\n",
    "Below is another example of such a view:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "## create a view for favorite_count on maybe_interesting\n",
    "view = maybe_interesting['favorite_count']\n",
    "print '-----------------'\n",
    "print \"This is view:\"\n",
    "print view\n",
    "## change a value\n",
    "view[8] = 9999\n",
    "\n",
    "## look at original frame\n",
    "print '-----------------'\n",
    "print \"This is view after changing view[8]\"\n",
    "print view\n",
    "\n",
    "print '-----------------'\n",
    "print \"This is maybe_interesting after changing view[8]\"\n",
    "print \"It changed too!\"\n",
    "print maybe_interesting['favorite_count']\n",
    "\n",
    "## to avoid this you can use copy\n",
    "independent_data = maybe_interesting['favorite_count'].copy()\n",
    "independent_data[10] = 999\n",
    "print '-----------------'\n",
    "print \"This is independent_data after changed at 10:\"\n",
    "print independent_data\n",
    "print '-----------------'\n",
    "print \"This is maybe_interesting after changing independent_data:\"\n",
    "print \"It did not change because we only changed a copy of it\"\n",
    "print maybe_interesting['favorite_count']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "What we covered today:\n",
    "============\n",
    "\n",
    "* Pandas data frames\n",
    "* Guidelines for friendly scraping\n",
    "* Scraping html sites\n",
    "* Scraping with Api's\n",
    "* Basic data cleanup\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Further material\n",
    "================\n",
    "\n",
    "* I highly recommend Katharine Jarmul's scraping tutorials\n",
    "* For example [this one](https://www.youtube.com/watch?v=p1iX0uxM1w8)\n",
    "* Pandas has extensive [documentation](http://pandas.pydata.org/pandas-docs/stable/)\n",
    "* Especially the [tem minutes to pandas chapter](http://pandas.pydata.org/pandas-docs/stable/10min.html)\n",
    "\n",
    "* [Greg Reda](http://www.gregreda.com/2013/10/26/using-pandas-on-the-movielens-dataset/) did a lot more pandas examples for the movie lens data set"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
